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Abstract 

Due to delay and energy constraints, a cognitive radio may not be able to perform spectrum sensing 
in all available channels. Therefore, a sensing policy is needed to decide which channels to sense. The 



c/3 

, ^, channel selection problem is the problem of designing such a sensing policy to maximize throughput 

while avoiding interference to primary users. The channel selection problem can be formulated as a 

^ reinforcement learning problem. Channel selection schemes that employ reinforcement machine learning 

^^ algorithms are vulnerable to belief manipulation attacks that contaminate the knowledge base of the 

f^ learning algorithms. In this paper, we analyze the security of channel selection algorithms that are based 

Cn against belief manipulation attacks. 



^ I. Introduction 

It is widely believed that cognitive radios (CRs) are one of the key technologies that can 
address the spectrum scarcity problem. It is expected that they will play an important role in 
maximizing spectrum utilization and help satisfy the QoS requirements of a number of important 
communications applications — from emergency first responders' public safety networks to mil- 
itary tactical networks. CRs often employ software-defined radio platforms that are capable of 
executing complex computational tasks to communicate efficiently without causing interference 
to licensed (a.k.a. primary) users. A specialized software module within a CR called the cognitive 
engine performs the aforementioned tasks, such as the optimization of the transmission/reception 
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(TX/RX) parameters and execution of spectrum sensing and spectrum access strategies. Most 
of the tasks performed by a cognitive engine require the use of machine learning algorithms, 
especially if those tasks need to be carried out in a distributed manner. 

Considering the computing limitations and energy constraints of a battery-powered CR, a 
CR may not be able to perform full-spectrum sensing (i.e., sense all available spectrum bands) 
because of its prohibitive cost. Therefore, a spectrum sensing policy at the medium access control 
(MAC) layer is needed to decide which set of channels to sense. The channel selection problem 
is the problem of designing such a sensing policy. The optimal channel selection strategy for 
an unlicensed user (i.e., secondary user) is based on the availability statistics of the channels. 
The availability of the channels is determined by the presence/absence of primary user signals 
in those channels. The channels' availability statistics are initially unknown to a secondary 
user and need to be estimated using sensing samples. The critical tradeoff that the cognitive 
engine faces in each timeslot is between transmission ("exploitation") on the channel that has 
the highest expected reward (e.g., throughput) and channel sensing ("exploration") to get more 
information about the expected rewards of the other channels. The exploitation vs. exploration 
tradeoff problem, such as the one just described, is central to an area of machine learning known 
as reinforcement learning. 

Spectrum learning is the process of learning the spectrum statistics (i.e., primary user occu- 
pancy information), which is crucial to enable CRs to sense/interpret their spectrum environment 
and make intelligent decisions to achieve efficient communication. Although spectrum learning 
is beneficial for CRs, it can pose a serious security vulnerability. A radio that can learn has the 
potential to be taught by malicious entities in an adversarial environment. This kind of threat 
may have a long-lasting impact on the cognitive radio network. 

As mentioned above, the design of the optimal sensing policy can be formulated as a rein- 
forcement learning (RL) problem. When the channels are assumed to be independent, it can be 
formulated as a special class of RL problems known as a restless multi-armed bandit process. 
Recent results (i.e., [[T| - [|4|) show that a surprisingly simple myopic policy that ignores the 
impact of the current action on the future reward is optimal when channels are identical. In 
this paper, we show that in adversarial environments, where an active attacker performs belief 
manipulation attacks against the machine learning algorithms executed on a cognitive engine, the 
myopic policy is no longer optimal and a softmax policy that exploits some level of randomness 
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outperforms the myopic policy. 

Our contributions can be summarized as follows: 

1) We derive closed-form expressions for the throughput of cognitive radios in an adversarial 
environment for the two-channel case in two channel selection policies, viz myopic pohcy 
and softmax policy. 

2) We derive the attacker's optimal attack strategy and the cognitive engine's optimal defense 
strategy by solving respective optimization problems for more than two non-identical 
channels. 

3) We identify and discuss two fundamental trade-offs in the security of spectrum learning: 
(1) the attackers tradeoff between the attack probability and the number of required 
observations for attack detection (i.e., the time that the attack detection system needs to 
detect an attack); and (2) the channel selection systems tradeoff between attack resilience 
and performance of a given channel selection policy. 

4) We prove that for sufficiently large attack probabilities, a softmax policy with a proper 
choice of parameters outperforms the myopic policy for all possible attacker's strategies. 

The rest of this paper is organized as follows. In Section [11} the work related to this paper 



are discussed. Section III provides the channel selection system model and the attack model. 



In Section IV we analyze sensing policies in an adversarial environment for a cognitive radio 



system with two channels and Section M extends the result to a cognitive radio system with 



more than two channels. Finally Section VI concludes the paper. 



II. Related Work 

The formulation of spectrum learning problem as a restless multi-armed bandit process is 
investigated in [[T|- [|6|. It is proved that when channels are identical and independent the myopic 
policy is the optimal policy [[2|- Q, and that this policy is a special case of Whittle's index policy 
for the restless bandit problem which can be computed for non-identical channels as well [|5|. An 
asymptotically optimal policy is proposed in [[6) for a more realistic case where the policy does 
not require any prior statistical knowledge about the traffic pattern and the channels are different. 
Despite all these work that assume a non-adversarial environment, this paper investigates the 
spectrum learning problem in an adversarial environment where active attackers aim to reduce 
the throughput of the cognitive radio network. 
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The security of machine learning algorithms that have been applied to applications such as 
intrusion detection systems (IDS) and spam filters is investigated thoroughly in ||7|- [11|. In 



these papers, the authors discuss how an adversary can maliciously mistrain a learning system 
in an IDS and how an attacker may contaminate the knowledge base of a spam email filtering 
system to bypass the filtering. In [8] and pO| different kinds of attacks against machine learning 
algorithms are introduced and a variety of potential defenses against those attacks are proposed. 
In the context of cognitive radios, security of machine learning algorithms that are used for 
signal classification are addressed in [[T8| and [19|, but the types of learning algorithms that are 



analyzed is different from the algorithms in this paper. To the best of our knowledge, this paper 
describes the first analysis of a reinforcement learning algorithm's vulnerability against belief 
manipulation attacks to cognitive radios. 

III. The Channel Selection System and the Attack Model 
In this section, we introduce the channel selection system and establish the attack model. 

A. Channel Selection System Model 

We consider a general dynamic spectrum access system where a user has access to A^ 



independent and stochastically non-identical parallel Gillbert-EUiot channels [12|, and chooses 
one channel to sense and access in each time slot, aiming to maximize its expected long-term 
reward (i.e., throughput). 

As illustrated in Figure 1, the state of the k-\h channel — either idle (1) or busy (0) — indicates 
that the channel is unused by primary users or it is occupied. The transitions between these 
two states follow a Markov chain with transition probabilities {pf^jj ,,=o,i- We assume that 
these transition probabilities for all channels are learnt in a non-adversarial environment before 
the system starts operating and thus the transition probabilities are known to the system. We 
also assume that p\^ > Pq^ or equivalently the channel states in two consecutive time slots 
are positively correlated. Note that this assumption is only used for derivation of closed-form 
expressions and can easily be relaxed by separately considering the case where this assumption 
does not hold. Due to its limited sensing and access capability, a secondary user chooses one 
of the A^ channels to sense and access in each slot. Designing an optimal sensing policy that 
governs the channel selection at each time slot can be formulated as a restless multi-armed 

DRAFT 




Fig. 1. A Gillbert-EUiot channel. 



bandit process for independent channels. We denote Sk{t) as the state of channel k in slot t that 
is given by the two-state Markov chain in Figure [T| Let Sit) = [Siit), • • • , SN(t)] e {0, 1}^ 
denote the full system state. 

The channel selection system is reward-based. At each time slot the secondary user selects one 
of the N channels to sense. If the sensed channel is occupied by primary user signals, the user 
collects no reward; otherwise it accesses the channel and collects one unit of reward. The system 
keeps periodical sensing and transmitting on that channel until a primary user appears on the 
channel or a jamming attack prevents the user from transmission on the channel. The secondary 
user's aim is to maximize the throughput (reward) over a horizon of T slots by choosing an 
optimal sensing policy. 

Due to limited sensing (i.e., sensing only one channel out of A^ channels), the full system 
state iS{t)) in slot t is not observable. However, it has been shown that a sufficient statistic for 
optimal decision making is given by the conditional probability that each channel is in state 1, 



given all past observations and decisions [13|. Referred to as the belief vector, we denote this 
sufficient statistic by fi(t) = [tui, ■ ■ ■ ,ujn], where uJk{t) is called the belief value of channel k 
which is equivalent to the conditional probability that Sk{t) = 1 given all past observations and 
decisions for that channel. Given the sensing action a{t) = k (the channel that is selected to 
be sensed in slot t) and the observation Sk{t) in slot t, the belief vector for slot t + 1 can be 
updated via Bayes rule through the following equation: 



COkit+l) 



Pll, 


a{t) = k,Sk{t) = l 


Poi, 


a{t) = k,Skit) = 


rK(t)), 


a(t) ^ k 



(1) 
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where T{x) = xp\^ + (1 — ^)Poi- 

A sensing policy vr specifies a sequence of functions vr = [tti, ■ ■ ■ , vTf], where nt maps the belief 
vector Vt{t) to a sensing action a(t). Multi-channel opportunistic access can thus be formulated 
as the following stochastic optimization problem: 

T 

TT* = argmaxE^[y^i?^^(f^(i))(t)|fi(l)], (2) 

TT ' * 

where 7ri(r2(t)) is the channel selected for sensing and Rnt{n{t))i't) is the reward when the belief 
vector is n(t) and the action nt{Vl(t)) is taken, and ^2(1) is the initial belief vector. If no 
information about the initial system state is available, each entry of ^2(1) can be set to the 
stationary distribution cuq of the underlying Markov chain: 

fc 

^0 — ~fc — \ — aT- ^■^> 

Poi + Vw 

Let Vt{VL{t)) be the value function which represents the maximum expected total reward that 

can be obtained starting from slot t given the current belief vector Vt{t). Given that the user 

selects channel k and observes Sk{t) in slot t, the maximum expected reward consists of the 

following two parts: 

1) The expected immediate reward: 

E[Rk{t)] = E[Su{t)]=uu{t). 

2) The maximum expected future reward: 

\4+i(r(fi(t)|A;,^,(t))), 

where T{Vt{t)\k, Sk{t)) denotes the updated belief vector for slot t + 1 as given in (1). If 
we maximize over all channel selections, we obtain the following optimization equation: 

Vt{n{t))= max {ujk{t) + Vt+r{T{n{t)\k,Sk{t)))}. 

k=l,-- ,N 

Because the value function is limited to horizon T, we have: Vrii^iT)) = maxfc=i ... at cufc(T) 
and Vi(r2(t)) = for t > T. Theoretically, the optimal policy vr* and its performance Vi(i7(l)) 
can be obtained by solving the above dynamic programming problem. However, because of the 



DRAFT 



impact of the current action on the future reward and the uncountable space of the belief vector, 
obtaining the optimal solution via the above recursive equations is computationally prohibitive. 

B. Attack Model 

We assume that there exists a single attacker in the environment who tries to decrease the 
throughput of secondary users by preventing them from transmission in some time slots, by 
employing adaptive interference techniques (or cognitive jamming attacks). Although many 
defenses against these attacks have been proposed, but none of these defenses is perfect or 
can address all classes of attackers. While such defenses can restrict the instantaneous effect 
of these attacks, they can not reduce the long-term effect of such attacks on cognitive radio 
network, when the network is using a learning system as part of its cognitive engine. These 
attacks gradually contaminate the knowledge base of a cognitive radio and lead the learning 
system to make wrong decisions which results in degrading the performance of the radio. 

The attacker uses the same equipment as secondary users, i.e., it is equipped with a CR 
that is comparable to a typical secondary user's CR in terms of battery capacity, transmission 
power, computing power, memory capacity, etc. Therefore the attacker is power-limited and 
wish to avoid jamming continuously, which quickly drains power and causes fast detection by 
the attack detection module of cognitive engine. We also assume that the attacker can attack 
only one channel at each time slot. Suppose that the attacker perform attacks in ta slots out of T 
consecutive time slots on average. We denote a = ^ as the attack probability in the environment. 
The attacker controls the attack probability in order to cause maximal damage to the network in 
terms of throughput reduction. It can easily be seen that a higher attack probability will result 
in a lower throughput for the cognitive radio system. 

We assume that the adversary possesses full knowledge about the network and its parameters. 
We introduce the notion of the attacker's optimal strategy using the following definition: 

Definition 1: The attacker's a-optimal strategy is a strategy that minimizes the throughput 
of the target cognitive radio while keeping the attack probability fixed to value a. 

C. Attack Detection Model 

The cognitive radio network employs a mechanism for monitoring network status and detecting 



potential malicious activity. This monitoring mechanism is proposed in [20 1 to protect the network 
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against jamming attacks. The monitoring can be done by specific monitor nodes in a distributed 
network and a detection algorithm is employed by the detection module at a monitor node; it takes 
as input observation samples obtained by the monitor node (i.e., failed transmission/successful 
transmission) and decides whether there is an attack or not. On one hand the observation window 
should be small enough, such that the attack is detected in a timely manner and appropriate 
countermeasures are initiated. On the other hand, this window should be sufficiently large such 
that the chance of a false alarm or a mis-detection is reduced. 

The sequential nature of observations at consecutive time slots motivates the use of sequential 
detection techniques. A sequential decision rule is efficient if it can provide reliable decisions 
as fast as possible. There exists a trade-off between detection delay and detection accuracy in a 
detection scheme, i.e. a faster decision unavoidably leads to higher values of the probability of 
false alarm PpA and probability of mis-detection Pm while lower values of these probabilities 
are attained at the expense of detection delay. For given values of PpA and Pm, the detection test 
that minimizes the average number of required observations (and thus average delay) to reach 
a decision among all sequential and non-sequential tests is Wald's Sequential Probability Ratio 
Test (SPRT) fT5l|. 

In our case, the test is between hypotheses Hq and Hi with Bernoulli probability mass 
functions (p.m.fs) /o and /i. Assume that F is a random variable with the Bernoulli distribution, 
where Y = 1 denotes a failed transmission event in a slot. Hq denotes the hypothesis that assumes 
the absence of an attack, and because the probability of a failed transmission in the absence of 
attack is pio (i.e. Pr{Y = 1} = pio), the corresponding p.m.f /o has a Bernoulli distribution with 
parameter 6^0 = pio- Similarly, because the probability of a failed transmission in the presence 
of an attack with attack probability a is 1 — pii(l — a) (i.e. Pr{Y = 1} = 1 — pii(l — a)). 
Hi that denotes the hypothesis that assumes the existence of an attack has a Bernoulli p.m.f /i 
with parameter 9i = 1 — pii{l —a). 

The logarithm of likelihood ratio at stage k with observations Xi,- ■ ■ ,Xk is: 

where Xj = 1, if the system observes a failed transmission at time slot i, and Xi = otherwise. 
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The decision variable is defined as follows: 

Sk > cb ^ accept Hi 

Sk < b ^ accept Hq 

b < Sk < a ^ take another observation 



The analysis in [20| shows that using this SPRT, the average number of samples needed for 



detecting an attack is: 

(J 

E\N\Hi\ = ^ r^^, 

where C is a fixed positive number. Using this equation for E[N\Hi] we have: 

dE[N\Hi] _ dE[N\Hi] dOi 
da dOi da 

-C(ln(|) + ln(i5|^)) 



Because 6i > Oq, we have — g^-^^ < and consequently E[N\Hi] is a decreasing function 
of a, i.e., increasing the attack probability would decrease the average number of required 
observations for attack detection. This poses a fundamental trade-off problem to an attacker: 
Increasing the attack probability, a, increases the impact of the attack on the target (i.e., lower 
its throughput), but it also enables a detection module to detect the attack sooner. We define the 
attacker's cost as the inverse of E[N\Hi]: 

giln(|) + (l-gOln(ie|) 

c 

Also in order to normalize the attacker's cost, we assume that the constant C is equal to the 
maximum value of the statement 6i \n{^) + (1 — 9i) In(j^) that happens at a = 1, i.e. C = 

ln(— ), therefore: 

0iln(fi) + (l-0i)ln(}^) 
Attacker Cost = '" ^ ^~'° . (4) 

This definition for the attacker's cost shows that by risking detection of the attack by a detection 
module, the attacker's cost increases. The equation |4] would be used as a measure for the 
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attacker's cost in the rest of this paper. 

IV. Sensing Policies Analysis in an Adversarial Environment with two 

CHANNELS 

In this section, we analyze and compare the myopic policy ||2| and the softmax policy |T4| 
in a hostile environment for A^ = 2 identical channels, i.e. p\j = pfj = pij. 

A. Analysis of the Myopic Policy 

The myopic policy explained in this section is identical to the one presented in [2], but 
in this section, we analyze the performance of this policy in an adversarial environment. A 
myopic policy ignores the impact of the current action on the future reward and only focuses 
on maximizing the immediate reward. At any given time slot t, the myopic policy for selecting 
a channel for sensing can be expressed as follows: 

7r"*(t) = arg max Ui{t). 

In [[3|, the authors proved that for the channel selection system that was introduced in Sec- 
tion 2.1, the myopic policy is the optimal policy for all A^. They also showed that the myopic 
policy has a simple structure that does not require the knowledge of the transition probabilities 
Pij or updates to the belief vector. 

We define the steady-state throughput of the myopic policy as: 

T^oo T 

where V{^{ri{l)) is the expected total reward obtained in T slots under the myopic policy when 
the initial belief vector is i7(l). The key to computing the throughput U is to first find how long 
a user stays in the same channel. Let us introduce the concept of a transmission period (TP), 
which represents the time that a user stays in the same channel. Let L^ denote the A;-th TP. In 
||3|, Zhao et al. showed that under the condition pu > poi, the steady-state throughput is: 

f/ = l-=, (5) 



— V^ L 

where L = limK^oo 1) ^ denotes the average length of a TP. 
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and 



and 
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Throughput analysis is thus reduced to analyzing the average TP length L. For A^ = 2, we 
can derive a closed-form expression of L as a function of the attack probability a, which leads 
to a closed-form expression of the myopic policy throughput U"^{a). 

Theorem 1: For N = 2, the average TP length of a myopic policy as a function of the attack 
probability, a, is given by: 

l-pii(l -a) 
where 

(1 - a)pg) - A 

A = uo[l-a)[l — , (8) 

1 -pii(l -a){pn -poi) 

(2) Poi - PoiiPu - Poi)'^ 
°^ Poi + Pio 

Proof: From the structure of the myopic policy, {Lfc}^]^ forms a first-order Markov chain 
for N = 2. When the system is running in an adversarial environment with attack probability «, 
the transition probabilities of {L^}^^ are given by 

\-ptt'\^-^) J = l 

""^^ ^^ Vtt'\^ - a)^-Vn'(l -Pn(l - «)) J > 2 ' 

where Pqi is the j'-step transition probability which is equal to wq — wo(pii — Poi)"'- Let 
R = {rjj} denote the transition matrix of {Lfc}^^ and let R(:, k) denote the k-th column of R. 
We have 

1-R(:,1)= . ^^;f^ y (9) 

1 -pii(l -a) 

and 

R(:, A;) = R(:, 2)(pii(l - a))^-^ for A; > 2, (10) 

where 1 is the unit column vector [1,1,---]^. We denote A = [Ai,A2,-''] as the stationary 
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distribution of {L^}^^, i.e. AR = A. Thus we have: 

[Ai,A2,---]R(:,fc) = A,. (11) 

Combining (|9]), ([10]) and ([TT]) results in 



Ai = 1 - , \k = Hpiiil - «))'-'. (12) 

J- — Pll(l-Q) 



Substituting ([12]) into ( [TT] ) and solving for A2, we get A2 = uj{1 — j9ii(1 — a)), where uj is given 



in ([7]). From ( [T2] ), we can find the stationary distribution as 



1 — CJ, k = 1 

tJil-puil-a))ipiiil-a)f-^ k>l 

Using ( [T3] ) to compute U^{a) = Yl^=i ^^k results in ([6]). ■ 

Using the results of Theorem 1, we can show that f/™(a) is a decreasing function of a, and 
thus an attacker can lower the target radio's throughput (which is employing myopic sensing) 
by increasing the attack probability a. However, as we discussed in Section 2.2, increasing a 
increases the probability that the attack is detected. 

B. Analysis of the Softmax Policy 

The softmax action selection policies are randomized policies where, at time t, the action at is 
chosen at random by the user according to some probability distribution giving more weight to 
actions which have performed well in the past. The greedy action is given the highest selection 
probability, but all the others are ranked and weighted according to their accumulated rewards 
[ [r4| . The most common softmax action selection method uses a Gibbs or Boltzman distribution 
for the action selection probabilities. It chooses action a at time slot t, with probability 

gl^a(t)/r 

where r is a positive parameter called the temperature and controls the greediness of the policy. 
High temperatures cause all the actions to be all equiprobable while low temperatures cause 
a high probability for greedy action, and in the limit as r — > 0, the softmax policy becomes 
equivalent to the myopic policy. 
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In this section, for simplicity and ease of computation, we use a Bernoulli distribution instead 
of a Boltzmann distribution, i.e., we choose action a at time slot t with probability 



Pait) 



q if a = aigm.axi=i^2^i{t) 

1 - g if a = arg mini=i,2 uJi{t) 



(14) 



We define the main probability q as the probability of taking greedy action (i.e. selecting the 
channel that has the highest coi). According to the definition of q and the softmax policy, we 
have 0.5 < g < 1 when N = 2. Also note that for q = I, the softmax policy reduces to myopic 
policy, i.e., myopic policy is a special case of softmax policy. 

The attacker's optimal strategy for attacking a channel selection system employing the myopic 
policy is simple: only attack the channel that has the biggest belief value, since the user does 
not transmit on other channels. The attackers optimal strategy against the softmax policy is not 
so straightforward. 
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time slot time slot 

(a) attacker's strategy with division probability d = 1 (b) attacker's strategy with division probability d = 0.( 

Fig. 2. attack strategy examples for fixed attack probability a = 0.5 



As mentioned in Section 2.2, the a-optimal strategy for an attacker is a strategy that minimizes 
the throughput of a cognitive radio while keeping the attack probability a fixed. Knowing that 



the softmax policy uses a fixed main probability q for channel selection (i.e., it uses ( [T4| ) for 
channel selection), the attacker divides its attacks between the two channels. We define d as 
the conditional probability of channel 1 (the greedy option for the policy) being attacked at a 
given timeslot, assuming that the timeslot is attacked and call it the division probability. Figure 2 
illustrates the concept of the division probability (the slots colored in gray are the time slots in 
which an attacker jams a channel). The attacker chooses d such that a cognitive radios throughput 
is minimized. On the other hand the channel selection system exploits its knowledge about the 
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optimal strategy of the attacker to select the main probability q such that the throughput is 
maximized. Assume that U^{q,d) define the throughput of the radio when the softmax policy 
uses a main probability of q and the attacker exploits the division probability d, constructing 
optimal attack strategy and its corresponding optimal channel selection probability distribution 
can be formulated as the following optimization problems: 
Optimization Problem 1: 



d* =minf/''(g, (i) 

d 



s.t. < ci < 1 



Optimization Problem 2: 



q* = max?7^(g, d* 
q 

s.t. 0.5 < g < 1 



The steady-state throughput of the softmax policy is given by: 



w 



lim 

T-s>oo 



where ^^'^^^(^(l)) is the expected total reward obtained in T slots under the softmax policy when 
the initial belief vector is ^2(1). As shown in Section 3.1, we only need the average TP length 
to compute the throughput for the softmax policy. Suppose that the attacker uses its optimal 
strategy. An analysis similar to what we did in section 2, results in the following theorem. 

Theorem 2: For N = 2, the average TP average length for a softmax policy with main 
probability q is given by: 

L'{q, d) = qL'^iad) + (1 - g)L"(a(l - d)), 



where U 



is the function defined in 



and L'"-(x) 



1 I Poi(l-a:) 
"^ l-pii(l-x)- 



Proof: Using a procedure similar to the one used in the proof of Theorem 1, we can readily 
show that if the channel selection algorithm always selects the channel with the smaller belief 
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value (which is the opposite of a greedy action), then the stationary distribution of {L^}^^ is 

l-poi(l-a), k = l 

M= \ , ^ (1j) 

Poi(l - «)(1 - Pii(l - «))(pii(l - a))'-=^ A; > 1 



Using ( [T5] ) to compute L"(a) = Xlfcli ^-^^ results in 

Poi(l -a) 



L"(a) = 1 + 



l-pii(l -a) 

By using the Bayes rule and the statement of Theorem 1, we can obtain the statement of 
Theorem 2. ■ 

To quantify how much randomness is added to the channel selection system by the softmax 
policy, we use the entropy of the channel selection probability distribution, "H. For the N = 2 
case, because we use Bernoulli distribution, Ti = — (gln(g) + (1 — g) ln(l — q)). By changing q 
from 1 to 0.5, the entropy increases from to its maximum value ln(2). 

We denote W^a) as the throughput of the cognitive radio, when it uses policy vr for channel 
selection and the attacker uses its a-optimal strategy. We use the following definitions to quantify 
the robustness and the performance of policy vr: 

Definition 2: The robustness of a policy vr for a channel selection system under an a-optimal 
attack is: R-{a) = 1 - ™^(^. 

Definition 3: The performance of a policy vr for a channel selection system under an a- 
optimal attack is: P"" = f/''(0). 

C. Numerical Results for two Channels 

Using the results of Theorem 2, it can easily be shown that for a = 0, g* = 1, i.e., a 
non-adversarial environment the optimal softmax policy is equivalent to the myopic policy. 
Solving the optimization problems 1 and 2 for other values of a is straightforward due to the 
problem's small solution space. Figure 3 shows the solutions to this problem for a range of 
attack probabilities. Because myopic sensing is a special case of softmax sensing, it is obvious 
that W^ > f/™. Numerical results illustrated in Figure 3 shows that except for small values of a 
(a < 0.1), f/"* is strictly greater than f/"*, i.e., softmax sensing outperforms myopic sensing. 

Figure 4 shows the trade-off between the robustness and the performance of the system for 
fixed attack probability a = 0.5. As it can be seen, increasing the randomness that is used in 
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Fig. 3. Throughput vs. attack probabiHty for N — 2. 
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Fig. 4. Performance and robustness vs. randomness for N — 2. 

the system by the softmax policy, increases the policy's robustness but it would decrease the 
policy's performance when no attacker exists. 
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V. Sensing Policies with More than Two Channels in an Adversarial 

Environment 

As mentioned in [2], because L^ is a random process with higher-order memory, obtaining 
a closed-form expression of throughput for A^ > 2 channels with different statistics is very 
difficult. Nevertheless we can show that a softmax policy that uses a Boltzmann distribution 
with an intelligent choice for the temperature, r, can outperform myopic policy for all possible 
strategies of the attacker, including its optimal strategy. 

Assume that the channel selection system selects one of the A^ channels at each time slot. 
When a channel is selected, the system keeps using that channel until a primary user begins to 
transmit on the channel. Suppose n(t) = {uJi{t) , uj2{t) , ■ ■ ■ ,Co'Ar(t)) is the belief vector of the 
system at time t and Q(t) = (qiit) , q2{t) , ■ ■ ■ ,qN{t)) is the corresponding probability vector, 
i.e. qi{t) = /j(r2(t)) is the probability of selecting channel i at time t where /(•) is a function 
that maps the belief vector into a probability value. For the sake of notation simplicity, we omit 
t in the belief and and probability vectors henceforth. Without loss of generality, assume that 
un > wtv-i > ■ ■ ■ > Wi and consequently q^ > qN-i > ■ ■ ■ > gi. As we mentioned in Section 2, 
the ^-optimal strategy for the attacker is a strategy that minimizes the throughput while keeping 
the attack probability a fixed. In order to do so, the attacker needs to attack channel i with 
probability adi, where dj is the division probability for channel i (i.e., the conditional probability 
of channel i being attacked at a given timeslot, assuming that the timeslot is attacked) which is a 
function of Q and i7. Also to keep a fixed, we need to have X]j=i di = \. The following theorem 
states the optimization problem that the attacker needs to solve to find its optimal strategy. 

Theorem 3: In order to find its a-optimal strategy, the attacker needs to solve the following 
equahty-constrained convex optimization problem: 
Optimization Problem 3. 

N 

min y^ qiL{uji, adi) 

di,--- ,dpf ' ' 
j=l 

N 

s.t^di = l,Vi : rfi > 

4 = 1 

where 
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Proof: In order to minimize the throughput, attacker needs to minimize the expected value 
for Lk{uj). We know that the attacker attacks this channel with probability adi, so we have: 

{1 — tUiil — adi), I = 1 

uji{l - adi){p\-^{l - adi)y Vio. ^ > 1 

It can easily be shown that 

ELfco; channel i] = I + ^ '\ . ^. 

1 -Pn(l -adi) 

So we have: 

E[Lk{uj)] = E[E[Lk{u})\channe\ i]] 

E uJi{l - adj) 

^J'^'^l-p\,{l-ad.)^- 

To prove that the E[Lk{co)] is a convex function of rfj's, we compute the Hessian matrix of this 
function: H^xn = [hij]. We can see that the Hessian of this function is a diagonal matrix where: 

_ 2qiUJia^p\^ 

''~ (l-pl^{l-ad,))^- 

It is obvious that we have Vz : ha > 0, thus the matrix H is positive semidefinite and as a result 

E[Lk{co)] is a convex function. ■ 

The above equality-constrained convex optimization problem can be solved by using elimi- 



{N+iy 



nation and Newton's method in ^^ — ^-^ steps [16|. However, solving this optimization problem 



requires attacker's exact knowledge about the channel selection strategy of the system and the 
statistics of all channels during a long period of time. Obviously, acquiring such knowledge is 
not feasible under most circumstances. Therefore instead of using the a-optimal strategy, the 
attacker can use alternative, simpler strategies. Each of these strategies differ in terms of the 
amount of knowledge on the channel selection system that is required. These strategies are: 

• Greedy Strategy: The attacker only knows the best channel for transmission at each time 
slot and attacks this channel, i.e., d^ = ^ and (ij = for 1 < i < A^ — 1. This strategy is 
the optimal attack strategy when the channel selection system uses a myopic policy. 

• Uniform Strategy: The adversary attacks all channels equiprobably, i.e., di = j^ : I < i < N. 
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This strategy can be used when the attacker does not have any knowledge about the channel 
selection system. 
• fi Strategy: The attacker only knows the channel statistics fi = (wi, ■ ■ ■ ,ujiy) and has no 
knowledge about the channel selection policy of the system. It has a Boltzmann distribution 
with an arbitrary temperature r^, i.e., di = j^"' ^"^^^ : 1 < i < N. Simulation results show 
that when the attack probability, a, is large, this strategy inflicts approximately the same 
effect as the a-optimal strategy. 
In order to find the best channel selection strategy, the system assumes the worst-case attack 
scenario when the attacker uses its a-optimal strategy, which is the solution to the Optimiza- 
tion Problem 3: d* = f*{Q,Vl). The channel selection system needs to solve the following 
optimization problem in order to maximize the cognitive radio's throughput. 
Optimization Problem 4. 

N 

max S2qiL{uJi,af*{Q,n)) 
1=1 

N 

s.t.^qi = l,yi : Qi > 

i=l 

Finding the global optimal solution of the non-linear optimization problem 4, gives us the 
amount of randomness that we need to add to the system in order to minimize the attack's effect 
and maximize the throughput. 

We can also show that for all possible attack strategies, including the a-optimal strategy, a 
softmax policy that uses a Boltzmann distribution with a well-chosen temperature outperforms 
myopic policy. 

Theorem 4: When the channel selection system is consisted of more than two identical chan- 
nels, for all attack strategies that the adversary may employ, with a fixed attack probability 
a > 1^°^" ) , a softmax policy that uses a Boltzmann distribution with temperature 

Up - poi 
^ ]„ P0i{N-a) ' ^^''' 

achieves a greater throughput than a myopic policy. 

Proof: Let con > ^Af-i > ■ ■ ■ > i^i denote the belief values of all channels in the first slot 
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of the k-th TP. For the myopic policy, the length L^ of this TP has the following distribution. 

{1 — lum(1 — a), / = 1 

ujn{1 - a){pn{l - a)y "^pio, 1>1 

And for the softmax policy, the length L^ of this TP has the following distribution. 

f 1 - w, 1 = 1 

Pr{Luip) = /} = <^ _ ' 

y a;(pii(l - ad)y Vo, I > 1 

where UJ = J2f=i Qi^iC^ - (^di) and d = J^fL^ qid, in which g^ = f'%^, = ^ (L-^j/. are 
the action selection probabilities that follow a Boltzmann distribution. It is readily observable that 
if cJ > u;iv(l — a), then Lfc(cJ) stochastically dominates Lk{ujN) and consequently the throughput 
of the softmax policy would be greater than the throughput of the myopic policy. We use the 
fact that Vi, poi < i^i < i^o which results in 

1 1 

]Yg(wo-Poi)/T — y« — JYg(poi-Wo)/T' 

and also the fact that Qn > j^ and d^ > j^. 



Now we show that uj > uin{1 — «)• Using ( [T8] ) and ( [TT] ), we have: 

N-l 

uJ = UNQNil - adN) + ^ qiUJi{l - adi) 

i=l 

N-l 

> UNQNil - adN) + ^ Poiqi{l - adi) 

i=l 
N-l 



> UNQNil - ad^) + ^Poi ^ ,^^_p^./, (l - ad 



i=l 



> UNqN{l - adN) + ^TT 7T7 r{N - 1 - a(l - dAr)) 

^v Poiyly — a) 

> UNiQNil - adN) + ;, ~ ^\ (N - 1 - a(l - dN))) 

[Jy — a) 

= WAr(l -a) 
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Fig. 5. Throughput vs. attack probabihty for N=4. 
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Fig. 6. Throughput vs. attack probabihty for N=10. 



A. Numerical Results for More than Two Channels 

We simulated the channel selection system to evaluate the performance of myopic sensing and 
softmax sensing for more than two channels. Figures 5 and 6 show the throughput of the channel 
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TABLE I 
Transition Probabilities 





Pu 


Pio 


Poo 


Poi 


Channel 1 


0.9 


0.1 


0.8 


0.2 


Channel 2 


0.95 


0.05 


0.8 


0.2 


Channel 3 


0.9 


0.1 


0.85 


0.15 


Channel 4 


0.95 


0.05 


0.85 


0.15 



selection system versus the attack probability for myopic and softmax policies when 4 channels 
and 10 channels are available, respectively. To observe the effect of the number of channels 
on our scheme, we used identical channels with transition probabilities pn = 0.9, pio = 0.1, 
Poo = 0.8 and poi = 0.2 to obtain the results in Figures 5 and 6. Softmax policies in these 
figures use a Boltzmann distribution with fixed temperature r = 2 for channel selection. 

As it can be seen, the throughput drop caused by the increase in the attack probability is more 
severe for the myopic policy. Comparing the slope of lines in Figures 5 and 6, we can see that 
increasing the number of channels from 4 to 10 makes the softmax policy more robust against 
the belief manipulation attacks. In Figure 6, the softmax policy's drop in throughput is barely 
noticeable even as the attack probability is increased to 0.5. 

Figures 5 and 6 also show that in a non-adversarial environment (i.e. when a = 0), the 
performance of myopic policy is better than the performance of a softmax policy. These figures 
confirm the findings of Q in which the authors showed that the optimal policy in non-adversarial 
environments is the myopic policy. 

To compare different attack strategies and to investigate the amount of required randomness 
in the channel selection system, we used the more realistic model of non-identical channels 
with different transition probabilities. Figure 7 shows the effectiveness of the different attack 
strategies for different attack probabilities in a system of 4 non-identical channels with transition 
probabilities shown in Table 1. 

As can be seen in Figure 7, the f2 strategys performance is close to that of the a-optimal 
strategy for large values of a. We can also see from the figure that the greedy strategy is the 
worst strategy among these four different strategies. 

It is well known that the temperature of a Boltzmann distribution, r, is a measure of the 
amount of randomness in the distribution, which can be inferred from the fact that the entropy 
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Fig. 7. Comparing different attack strategies. 
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Fig. 8. Optimal amount of randomness vs. attack probability 



of the Boltzmann distribution is an increasing function of r [21 1. Figure 8 plots the optimal 
temperature value for the channel selection system when the attacker uses the a-optimal strategy. 
These temperature values were obtained by solving Optimization Problem 4. From the figure, 
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0.4 0.6 

Attack Probability 
Fig. 9. Attacker's cost vs. attack probability 

we can observe that the channel selection system needs to increase the randomness (i.e., r) in 
its learning process as the attack probability (i.e., a) is increased to minimize the effect of the 
attack. 

We observed the effect of increasing attack probability on performance of the channel selection 
system. Figure |9] plots the attacker's cost using the equation |4] versus the attack probability. This 
figure shows that though increasing the attack probabihty reduces the performance of the channel 
selection system significantly, it also increases the attacker's cost with a high rate. 

VI. Conclusion 

In this paper, we analyzed the security of a reinforcement learning algorithm that is used 
for solving the channel selection problem. We proposed a sensing policy that uses some level 
of randomness in the decision function to hide information about the learning algorithm. The 
obtained theoretical and simulation results show that the proposed mitigation technique can cause 
a dramatic improvement in the robustness of the channel selection process to adaptive jamming 
attacks. This countermeasure is applicable to other cognitive radio applications that use the same 
type of machine learning algorithm. 
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